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Abstract 



We consider unconstrained randomized optimization of smooth convex 
objective functions in the gradient-free setting. We analyze Random Pur- 
suit (RP) algorithms with fixed (F-RP) and variable metric (V-RP). The 
algorithms only use zeroth-order information about the objective func- 
tion and compute an approximate solution by repeated optimization over 
randomly chosen one-dimensional subspaces. The distribution of search 
directions is dictated by the chosen metric. Variable Metric RP uses novel 
variants of a randomized zeroth-order Hessian approximation scheme re- 
cently introduced by Leventhal and Lewis (D. Leventhal and A. S. Lewis., 
Optimization 60(3), 329-245, 2011). We here present (i) a refined analysis 
of the expected single step progress of RP algorithms and their global con- 
vergence on (strictly) convex functions and (ii) novel convergence bounds 
for V-RP on convex quadratic functions. We also quantify how well the 
employed metric needs to match the local geometry of the function in 
order for the RP algorithms to converge with the best possible rate on 
strongly convex functions. Our theoretical results are accompanied by 
illustrative experiments on Nesterov's worst case function and quadratic 
functions with (inverse) sigmoidal eigenvalue distributions. 

1 Introduction 

Since its inception by Davidon in the late 1950's variable metric methods 
have become a cornerstone in first-order (non-)convex continuous optimization. 
Among the many instances of variable metric schemes Quasi-Newton methods 
such as the BFGS scheme [TJ [31 HI [T7] are ubiquitous in all areas of science and 
engineering. In zeroth-order (or gradient-free) optimization, the idea of using 
a variable metric guiding the search for local or global optima has surprisingly 
been used to a far less extent. Although "directional adaptation" has been con- 
jectured to be useful for randomized gradient-free schemes in the late 1960's [T5] 
the literature on this topic is scarce and scattered across different communities 
ranging from electrical engineering, optimal control, bio-inspired optimization to 
mathematical programming. Important examples include the Gaussian Adap- 
tation algorithm developed by Kjellstrom and Taxen [TTl [T3] in the context of 
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analog circuit design, Marti's controlled random search schemes using concepts 
from optimal control [13J, and the arguably most popular scheme, Hansen's 
Evolution Strategy with Covariance Matrix Adaptation [5] that emerged in the 
bio-inspired optimization community. 

Despite their great appeal in practice many randomized gradient-free vari- 
able metric schemes lack a thorough theoretical convergence analysis. A marked 
exception is Leventhal and Lewis' recent work on Randomized Hessian approx- 
imation [inj. We here adopt some of their ideas and extend our framework 
of Random Pursuit (RP) [19], eventually leading to Variable Metric Random 
Pursuit (V-RP) schemes. We solely consider optimization problems of the kind: 

min/(x) subject to x G R" , (1) 

where / is a smooth convex function. We assume that there is a global minimum 
and that the curvature of the function / is bounded. Moreover, we assume that 
we have only access to function values of /. No analytic gradient or higher order 
information about / is available. 

To motivate Variable Metric Random Pursuit, let us first sketch the working 
mechanism of standard Random Pursuit on an illustrative example. Each itera- 
tion of standard Random Pursuit consists of two steps: (i) a random direction is 
sampled from an isotropic probability distribution; (ii) the next iterate is chosen 
such as to (approximately) minimize the objective function along this direction 
(using an approximate line search procedure). In |19| we have shown that the 
expected error in function value decreases by a factor of 



in every step, if m > and > are parameters of quadratic functions that 
bound the difference between / and any of its linear approximations from below 
and above; more precisely, 

f ||y - x||^ < Uy) := /(y) - /(x) - (V/(x), y - x) < | ||y - x||^ (2) 

is assumed to hold for all x, y. As an example, let us consider the function 

foixi,X2) = lOOxj + xl, 

for which ^x(y) = 100(a;i — yi)^ + {x2 — 2/2)^. This means that m = 2 and 
ii = 200 are the best possible parameters in ([2]), and the progress rate in every 
step is no better than (1 — 1/200). This also matches our intuition: every level 
set of fo is a long and skinny ellipse, stretching out along the a;2-axis; if we start 
from a point close to the X2 axis, the progress in a step will be small, unless we 
almost sample in X2-direction. 

For this particular function /o, it would be better to sample from an anisotropic 
distribution that favors the X2-direction. Once we fix such an anisotropic sam- 
pling distribution, however, other functions become "bad"; in fact, without prior 
knowledge about /, anisotropic sampling makes no sense at all. Here is where 
the "variable metric" approach comes in. The idea is to gradually adapt the 
sampling distribution to the function / while we run the algorithm. Suppose 
that we can somehow estimate the Hessians at the various iterates. Under the 
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assumption that / is wedged between two quadratic functions — whose Hessians 
are not necessarily multiples of the identity, as in ^ — these estimates will allow 
us to learn a suitable metric that guides the sampling distribution. In case of 
/o, we would start with the isotropic one and then converge to a distribution 
that indeed favors the a;2-direction with the right proportion. 

In this contribution we present a framework for analyzing the convergence 
behavior of Random Pursuit algorithms on convex functions. In a first step we 
analyze the Fixed Metric Random Pursuit (F-RP) algorithm, a natural exten- 
sion of Random Pursuit with an arbitrary but fixed anisotropic sampling distri- 
bution. We obtain a progress rate that depends on the chosen distribution, as 
well as on lower and upper quadratic approximations to /. Our progress rate 
bound improves over the one we would get from the standard Random Pursuit 
analysis |19j . after applying an affine transformation that maps the sampling 
distribution back to the isotropic one. The improvement is due to the fact that 
the new analysis takes the whole spectrum of the quadratic upper bound into 
account rather than just a scalar value £i. 

In a second step we equip Random Pursuit with a randomized scheme to 
update the metric that defines the sampling distribution in every step: the 
Variable Metric Random Pursuit. We present two novel variants of an update 
scheme recently proposed by Leventhal and Lewis [12] . These learning schemes 
are generic in the sense that they work for all convex functions and do not 
require any prior knowledge of the function's shape. We then concentrate on 
the class of convex quadratic functions and rigorously prove in this case that 
the sampling distribution converges to a distribution that yields asymptotically 
optimal (and function-independent) progress rates. We also provide bounds on 
the number of iterations that the algorithm requires in order to attain optimal 
progress rates with high probability. 

The remainder of the paper is structured as follows. In Section [2] we give a 
generic description of the different Random Pursuit algorithms and their essen- 
tial building blocks. We introduce all relevant mathematical definitions such as 
matrix upper and lower bounds of convex functions and expressions for certain 
scalar and matrix expectations in Section [3j We derive the expected single-step 
progress and global convergence of F-RP in Section |4] Section [5] is dedicated to 
Variable Metric Random Pursuit. We derive theoretical convergence results and 
show a number of illustrative numerical examples. We discuss the key results 
of the paper and outline future research goals in Section [6) 

2 Fixed and Variable Metric Random Pursuit 

All Random Pursuit algorithms are designed for problems as defined in ([I]) where 
/ is assumed to be a differentiable convex function with the property that it 
has a minimum along every line in R". Before stating the formal definition of 
the considered RP algorithms we need to define two primitives. 

Definition 1 (Multivariate normal distribution) . The multivariate normal dis- 
tribution arises from independent and identically distributed (i.i.d.) standard 
normals. The vector v € R" is standard multivariate normally distributed, i.e., 
V ^ A/'(0, /„) if Vi ~ A/'(0, 1) for i = 1, . . . ,n and /„ the identity matrix. For 
fi G R", S = CC"'" G PD„, where PD„ is the set of symmetric positive definite 
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matrices, u ^ A/'(/x, S) is multivariate normal with mean fj. and covariance "S 
if u = fi + Cv and v ^ A/'(0, J„). 

Definition 2 (Line search oracle). For x e R", a direction u e R" \ {0} (not 
necessarily of unit length), and a convex function f, a function LS / : K" x R" — >■ 
M with 

LSf (x, u) — aremin f(x + hu) (3) 

■' heR 

is called an exact line search oracle. The analysis in [TH] shows that an 
approximate line search oracle is sufficient to establish the same asymptotic 
convergence bounds which is important in practical applications. For simplicity 
we assume here that we have access to an exact line search oracle. 

The two RP schemes considered here are summarized in Fig. [T] 



F-RP(/,Xo,E,iV) 



Output: Approximate solution Xff 

toQ 

1 for k = 1 to N do 

2 Ufc~Ar(0,S) 

3 ^Xfc ^ LS/(Xfc_i,Ufc) 

4 return x^r 



V-RP(/,Xo,Bo,Af) 



Output: Approximate solution Xff 

tog 

1 for k = 1 to N do 

2 -Bfc ^ updateHess(/, X, _Bfc_i) 

3 Ufc ~AA(0,B^7') 

4 |_Xfc ^ LS/(Xfc_i,U|t) 

5 return xat 



Figure 1: Fixed Metric Random Pursuit (left panel) and the Variable Metric version 
(right panel). The generic sub-routine updateHess on line 2 exemplifies any function 
that generates the metric Bt in step k. Two specific instantiations are discussed in 
Section [5] (cf Fig.[2j. 

In Fixed Metric Random Pursuit (F-RP) a direction u e R"\{0} is sampled 
from a multivariate normal distribution with fixed covariance S at iteration k of 
the algorithm. The next iterate x^ is calculated from the current iterate x^-i 
as 

Xfc :=Xfc_i +LS/(xfe_i,u) -u. (4) 

This algorithm only requires function evaluations on top of the exact line search 
oracle. Note that an approximate line search oracle can efficiently be imple- 
mented with function evaluations and binary search. No additional first or 
second-order information about the objective is needed. Besides the starting 
point no further input parameters describing function properties (such as cur- 
vature constant etc.) are necessary. The actual run time will, however, depend 
on the specific properties of the objective function. 

Variable Metric Random Pursuit (V-RP) comprises an independent process 
that gives an approximation of the Hessian at each iteration. The inverse of the 
Hessian is then used as covariance matrix in the multivariate normal distribu- 
tion to generate the current search direction. In principle, any deterministic or 
randomized gradient-free estimator can be used for this purpose. In Section[5]we 
present two variants of a Randomized Hessian approximation scheme recently 
proposed in [12] for which theoretical guarantees are known. 
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3 Definitions and Notations 



3.1 Quadratic norms 

Let PD„ denote the set of symmetric positive definite n x n matrices. With 
respect to A G PD„, we can define an 'anisotropic' inner product and a corre- 
sponding norm by 

(x,y)^ y^^x, and |lx|l^ := (x,x)^ , 

for X, y e R". In statistics the induced metric is also known as the Mahalanobis 
metric. We observe that 

A„.i„(^)llxf < 11x11^ <A„,,.(A)|lxf, (5) 

due to Amin(^) = minjx-^Ax : = 1} and Ainax(^) — max{x-^^x : — 1}. 
We wiU frequently need the following lemma. 

Lemma 3. Let A, B € PD„,x e R" with x 7^ 0. Then 

A,„i„(i?-M) < < A„,ax(i?-M) . 

Mb 

Proof. Suppose that B — CC^ and set y — C'^x. Then we have 

M\a ^ llyllc-Mc-^ 
Mb \\y\\ ' 

where C^^ is a shorthand for (C"^)^^. Using ([s]), we can argue that 

It remains to show that the matrices C^^AC^'^ and C^^C^^A — B^^A have 
the same eigenvalues. This follows from the "Rotation" Lemma |4j □ 

Lemma 4 (Rotation). Let L e R"^", and let C G R"^" be an invertihle 
matrix. Then the two matrices 

P := LCC'^ and Q := C'^LC 

have the same eigenvalues. 

Proof. We show that P and Q have the same characteristic polynomial. For 
this, we first observe that 

P~tIn = C-^iQ-tIn)C^. 

It follows that 

det(P - tin) = det{C-^{Q - tIn)C^) = det(C~^) det(C'^) det(Q - t/„) 

= det(Q - tin) ■ □ 
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3.2 Quadratic bounds 

We now introduce some important inequalities that are useful for the subsequent 
analysis. We always assume that the objective function / is differentiable and 
convex. The latter property is equivalent to 

/(y)>/(x) + (V/(x),y-x), x,yell". (6) 

We also require that the curvature of / is bounded. However, we allow for 
different curvatures depending on the direction. By this we mean that for some 
fixed symmetric and positive definite matrix Li G PD„, 

/(y)-/(x)-(V/(x),y-x) < i||x-y||i^, x,yeR". (7) 

We will also refer to this inequality as the (matrix) quadratic upper bound. It 
means that the deviation of / from any of its linear approximations can be 
bounded by a quadratic function. 

A differentiable function is strongly convex with parameter M £ PD„ if the 
(matrix) quadratic lower bound 

/(y)-/(x)-(V/(x),y-x) > ^lly-xf,,, , x,y€R", (8) 

holds. Let x* be the unique minimizer of a strongly convex function / with 
parameter M. Then equation ([8| implies this useful relation: 

i||x-x*||2,,</(x)-/(x*)<i||V/(x)|p,,_,, VxeR". (9) 

The former inequality uses V/(x*) — 0, and the latter one follows from ([s]) via 

/(x*) > /(x) + (V/(x),x* - x) + i ||x* - 

> /(x) + min (^(V/(x),y - x) + ^ ||y - x||^,^ = /(x) - ^ || V/(x)||i,_, 

by standard calculus. 

3.3 Expectations involving normally distributed random 
variables 

We here review and derive certain facts about the moments of the standard 
normal distribution for the later convergence analysis. 

Lemma 5. Let u e A/'(0, S) be drawn from the multivariate normal distribution 
over R" with covariance E G PD„. 

(i) Then for all indices i,j, fc, 

E[u,] = , E[u^UJ] = , E [uu^] = S , 

and 

E[uiUjUkUi] — J^ij'Ekl + ^ik'^jl + . 
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(ii) Let A S SYM„ be a symmetric n x n matrix. Then 

E [u^Au] = Tr[^S] , E [u^Au • uu^] = Tr[AE]E + . 

Proof. The first three equahties in part (i) are easy consequences of the above 
definition. The last one is known as Is.serlis' Theorem [9]. 
The matrix equations in (ii) easily follow from (i) via 

n n 

E [u^Au] = E[u,UjA,j] = = Tr[A^S] , 



and 



(E [u^Au ■ uu"^]) ^ E [uiUjUkUiAki] 



k,l = l 



□ 



= Tr[AE]E,, + (SAS)^^. + (SAS)^.^, 

using symmetry of SAE. 
Lemma 6. Let v ~ S""^^ a random unit vector and let A e SYM„. Then 
rn. \ T A Ti Tr A /„ + 2A 

E V Aw ■ VV = ; r . 

L ^ n{n + 2) 

Proof. Let u ~ Af{0,ln). We observe that the random vector w = u/ |lu|| has 
the same distribution as v. In particular 



Ev [v^^v • vv"^] = E„ 



u^ylu • uu^ 



Eu [u^Au ■ uu^] _ Tr[A]/„ + 2A 



where the second equality follows from independence of 
Using Lemma [5] (ii) with A = E = /„, we get 



u Au uu 



E 



Tr [E [u^u • uu^]] = n{n + 2) 



n(n + 2) 



and lluf i. 



□ 



We finally need the following lemma on the expectation of certain scalar 
products: 

Lemma 7. Let u e Af(0, E) be drawn from the multivariate normal distribution 
over R" with covariance E G PD„, let A e PD„ and x € R". Then 



(X, u) u|| 



= Tr[^E] ||x||^ +2||x"^ 



E [(x, u) u] ~ Ex , and E 

Proof. We calculate 

E [(x, u) u] = E [uu^x] = E [uu'^] x = Ex , 
by Lemma [5] (i) . For the second moment we get 



E 



(x,u) u|| 



= E[x^uu^Auu^x] = x^E[uu^^uu^]x, 



and the claim follows by Lemma [s] (ii), observing that u • u^Au ■ ~ u-^^u ■ 



□ 
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4 Convergence of Fixed Metric Random Pursuit 

To prepare the convergence proof of Algorithm F-RP, we study the expected 
progress in a single step, which is the quantity 

/(xfc)-E[/(x,+i) |xfc] . 

We will now derive the global convergence rates for convex and strongly convex 
functions. 



4.1 Single step progress 

Once a search direction is determined, the subsequent iterate is chosen according 
to Equation [4] Because the step size is determined by the line search oracle, we 
can derive the following lower bound on the single step progress. 

Lemma 8. Let f be convex with quadratic upper bound Li e PD„ , x G R" and 
x+ be the next iterate after one step of Algorithm F-RP in direction u e ]R"\{0}, 
cf. Equation^ Then 

/(x+)</(x)+t(V/(x),u) + ^|lu||i^ , (10) 

for every i G R. 

Proof. This is a simple consequence of Definition [2j For every t it holds 

/(x+) =/(x + LS;(x,u)-u) 
</(x + tu), 

and the claim follows from the quadratic upper bound ([7| with y = x + tu. □ 

In order to get a bound on the expected single step progress, let us fix 
a (possibly suboptimal) step size t — — ft,(V/(x),u) with (yet to determine) 
parameter ft, G R. With this step size the single step progress can be calculated. 

Lemma 9. Let f we convex with quadratic upper bound Li G PD„, x G R" 
and x+ be the next iterate after one step of Algorithm F-RP with direction 
u^A/'(0, S), eg. Equation^ Then 

E[/(x+) |x] </(x)-ft||V/(x)||^ (11) 
+ '^(Tr[Liq||V/(x)||^ + 2||V/(x)||2^^^^) , (12) 

for every /i G R. 



Proof Let t = -h (V/(x), u) in Then 

E [/(x+) I x] < /(x) - hE [(V/(x), (V/(x), u) • u) 



— E 11""^^--^ -11^ 
2 



(V/(x),u).u||^^ 
-/(x)-ft(V/(x),SV/(x)) 

+ y (Tr[LiE]||V/(x)||^ + 2||V/(x)||L^^ 

with Lemma [T] □ 
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Now we are ready to eliminate the last free parameter h in our bound on the 
single step progress. The best choice of the parameter is obviously the one that 
minimizes the right hand side of (11 1. By taking the derivative with respect to 
h, we see that h must satisfy 



- ||V/(x)||^ + h [Tr [LiE] ||V/(x)||^ + 2 ||V/(x)||^^^5,j = . 

With this choice of h, we readily obtain our final bound on the single step 
progress. 

Lemma 10. Let f he convex with quadratic upper hound Li G PD„, x G R" 
such that V/(x) ^ 0, and let X-|_ be the next iterate after one step of Algorithm 
(V)RP with direction A/'(0, S), cf. Equation^ Then 



1 



E [/(x+) I x] < /(x) - , ||V/(x)||^ , (13) 

where (t(x) : 



2Tr[LiE] +4cr(x) " ' ■> ^'^"^^ ' 

l|V/(x)||^^,^ 



l|V/(x)||^ ■ 

This lemma shows that, on average, there is progress in every single step 
if |lV/(x)||j^ is bounded away from zero Q This can be assured for all strongly 
convex functions, but not for convex functions in general. For this reason, we 
also derive a slightly different bound on the one-step progress. 

Lemma 11. Let f he convex with quadratic upper hound Li G PD„, x G R" 
and x_|_ be the next iterate after one step of Algorithm F-RP with direction 
u ~ A/'(0, S) (see Eq. In addition, let x* G R" he one of the minimizers of 
f. Then, for every positive h > it holds that 

E[/(x+)-/(x*)|x]<(l-;i) (/(x)-/(x*)) 



fTrim , \ „ (14) 

'^max(,-t^l ) 



2 \Aniin(S) 



Proof. Let t = — /i^E ^(x — x*),u) in ( [To| . Then, similarly to the proof of 
Lemma [9] we derive 

E[/(x+) |x] </(x)-/i(V/(x),x-x*) 

,2 (15) 
+ y (Tr[iiS]||x-x*||^_,+2||x-x*||iJ , 

with Lemma [t] Using the definition of convexity ^ we can bound the term 
(V/(x),x — X*) from below by /(x) — /(x*). By subtracting /(x*) on both 
sides of (fTsl) we thus arrive at 



E [/(x+) - /(x*) I x] < (1 - ;^) (/(x) - /(x*)) 

+ y (Tr[LiS]l|x-x*||^_,+2||x-x*||iJ . ^^^^ 

By bounding the last two terms from above with ([s]), we obtain (14), using 
A„>ax(S-^) = l/A„,i„(E). □ 



^Here we use that Tr [LiS] > 0; see the remark after Theorem |l3l 
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4.2 Global convergence 

We now use the previously derived bounds on the expected single step progress 
(Lemma 10 and Lemma 111 to show convergence of F-RP in expectation. We 
first show convergence on smooth but not necessarily strongly convex functions. 

Theorem 12. Let f be convex with quadratic upper bound Li G PD„ , let x* e 
R" be a minimizer of f and let the sequence {xfc}fe>o be generated by Algorithm 
F-RP with covariance S € PD„. Assume there exists R € M,, s.t. ||y — xo|| < i? 
for all y G K." with /(y) < /(xq). Then, for any N >0, we have 

E[/(xA.)-/(x*)]<^, (17) 

where 

Q:=max{2c^(Li,S),/(xo)-/(x*)} , c^(Li, E) ^^^^ + 2A,nax(Li) • 
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Proof. By assumption, ||xj. — x*|| < R for all fc = 0, 1, . . . , A^. With Lemma 
it follows for any step size hk > 0: 

E [/(x,+0 - /(x*) I Xfc] < (1 - h,) (/(x,) - /(x*)) + ;,2 ^Mii^ (^8) 

Taking expectations over x^, we obtain 

E [/(xfc+i) - /(x*)] < (1 - h,)E[f{^,) - /(x*)] + ;,2^RMii^ (^9) 

As in [m Theorem 5.3], the choice hk := ^ for fc = 0, 1, . . . , (iV - 1) yields 

E[/(x.)-/(x*)]<^.^!^^^. □ 

On strongly convex function the convergence of F-RP is linear. 

Theorem 13. Let f be convex with quadratic upper bound Li G PD„, and 
let f in addition be strongly convex with parameter M G PD„. Let x* G R" 
denote the unique minimizer of f and let the sequence {xk}k>o be generated by 
Algorithm F-RP with covariance E G PD„. Then 

E [/(x„) - < (l - jrjjjfff;^) " ■ (/(X.) - /(X)) . (20) 

Proof. We use Lemma [3] to bound 

||V/(Xfc)|||^^5. _^ 

O'(Xfc) = — 2^ - •^max(S ELiE) = Ainax(ilS) 

l|V/(Xfc)||5. 

for fc = 0, 1, . . . , TV - 1. Thus Lemma [lO] yields 

E [/(x.,0 I X.] < /(X.) - ( ,T..[X,q+4A...(L,E) ) ' ^''^ 
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for fc = 0, . . . , — 1. Using Lemma [3] again, we get 

||V/(x,)||2 > A,„in(A/S)||V/(xfc)||2,_i . 

Applying the quadratic lower bound (|9| to further bound the latter term from 
below yields 

||V/(xfe)||^ > 2A,„in(A/S) (/(x,) - /(x*)) . 
Now, plugging this into ( |2l| ) and subtracting /(x*) on both sides yields 

E [/(x.„) - I xd < (i - ^siiT^rarsj) ' - • 

for fc = 0, . . . , iV — 1. Finally, taking expectation (over x^) yields the recurrence 
E [/(X.,,) - /(X*)] < (l - ^[^^^:^^(^^^) ) • E [/(X.) - /(X*)] , 

and the theorem follows. □ 

We remark that the progress is strict: With E = CC'^, "Rotation" Lemma|4] 
in Section [3] along with Sylvester's law of inertia yields that all the three terms 
Amin(AjfS),Tr[LiE] and Amax(iiS) are positive. 

It is not necessary that the function / is strongly convex everywhere for 
linear convergence to hold. Let us recall that strong convexity with parameter 
M implies that 



/(x) - /(x*) > ^ ||x - ,Vx e R" . (22) 



It turns out that, instead of strong convexity ([8|), the weaker condition (22) is 
enough for linear convergence. 

Theorem 14. Let f be convex with quadratic upper bound Li G PD„, and 
let f have unique minimizer x* € R" satisfying ( 22 ) with M G PD„ . Let the 
sequence {yik}k>o &e generated by Algorithm F-RP with covariance E G PD„. 
Then 

E [/(x^.) - /(x*)] <{l-l^'' . (/(xo) - /(x*)) , (23) 



where 



Proof. To see this, we use ( |16[ ) to get 

E [/(x+) - /(x*) I x] < (1 - /i) (/(x) - /(x*)) 



Tr [LiE] 2 

(ME) A„,i„(A/ir') 



-(Tr[LiE]||x-x*f^_, +2||x-x*||i^ 



< {l-h + h'9\{f{^)-f{^*)) 
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where we used Lemma [3] to bound 



|x-x*||?,_i < llx-x*"^ 



A,nin(AfE) 

and 

||x-x*||i^<|lx-x*|lL- ^ 



followed by 

Setting h to ^ the term in the left bracket becomes (l — 3^) and the proof 
continues as the proof of Theorem [13] □ 



5 Metric Learning in Random Pursuit 

Our bounds on the progress rate of F-RP with fixed covariance matrix S con- 
cisely describe the influence of the matrix upper and lower bounds and the 
chosen covariance matrix on the convergence rate of RP algorithms. For in- 
stance, for the special case of quadratic functions of the form /(x) = |x^i/x 
with H G PD„ and trivial quadratic upper and lower bound Li = M = H, th e 

choice S = H^^ leads to the expected progress rate (1 — -^^] (cf. Thm 13 1. 



This rate is (i) near-optimal from a theoretical point of view |10| and (ii) inde- 
pendent of the function / (i.e., the spectrum of H). For general strongly convex 
functions /: E," — > R, the Hessian V^/(x) is not constant for all x G R". 
However, if we assume that the Hessian is only mildly changing (for instance, 
Lipschitz continuous) then for all x g R" that are close to the unique min- 
imizer x* G R" the corresponding Hessians will also be close, meaning that 
V^/(x) « V^/(x*) in some norm. The choice E"^ = V^/(x*) is thus likely to 
also yield a good convergence rate on this function class. 

We are now left with the challenge of how to efficiently learn a suitable covari- 
ance matrix (that induces the right metric) on smooth convex functions in the 
present gradient-free setting. Iterative stochastic covariance matrix adaptation 
schemes are well-established in gradient-free continuous optimization [TTl [31 [T3] 
but notoriously difficult to study theoretically. A welcome alternative has re- 
cently been introduced by Leventhal and Lewis |12] in form of a Randomized 
Hessian approximation scheme. We here review and extend their scheme which 
we refer to as Variable Metric (VM) update. For twice differentiable functions 
/: R" — >■ R, x G R" and initial iterate Bq G PD„, Leventhal and Lewis al- 
ready showed that the VM scheme generates a random sequence {Bk}k£K of 
iterates that converge to V^/(x). For quadratic functions, their analysis also re- 
veals exact rates for the convergence in expectation of the sequence of estimates 
{Bk\k&m to the true Hessian H. We are, however, interested in the inverses 
of these matrices, since we want to understand convergence of the covariances 
S = SjT^, fc G IN to the optimal covariance . We know that for a sufficiently 
large number K of steps, B^ « H^^ almost surely, but explicit bounds for K 
do not directly follow from existing results. 

We here address this question and provide, in addition, two novel, theoreti- 
cally sound, and easy-to-implement VM updates. We also study their numerical 
performance on Nesterov's worst case function and quadratic functions with (in- 
verse) sigmoidal spectral distribution. For simplicity, we analyze VM updates 
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and their interplay with Random Pursuit solely on convex quadratic functions. 
The general convex case is subject of future research. 



5.1 Variable Metric update schemes 

The generic VM update [12] comprises direct updates of a Hessian estimate. 
Given a symmetric matrix B G PD„ as current Hessian estimate, a direction u 
is selected uniformly at random from the n-dimensional hypersphere S"~^ (i-e., 
u ^ 5"^^). The next iterate B+ is determined according to: 

B+ = B + {H - B)u- uu^ . (24) 

This formula requires the evaluation of Hu with unknown H. For twice 
differentiable functions / the second derivative of / at x in direction u can be 
well approximated by finite differences: 

u-Hn . /(x + ^u)-2/(x) + /(x-eu) ^^^^ 

for some small e > 0. In the convex quadratic case, the above formula is exact 
for arbitrary e > 0. Note that this formula only requires two additional function 
evaluations at x±eu. In addition, the formula implies that the estimate i3+ be- 
haves at X like the unknown Hessian along direction u, that is, u^B+u = u^Hu. 



This can be seen directly from (24) by noting that u uu u = 1. Unfortunately, 
the update does not guarantee that the matrix B+ stays positive definite. In 
order to be useful in Variable Metric Random Pursuit, an additional correc- 
tion step is thus required. Leventhal and Lewis suggest an ad hoc projection 
of _B+ onto the cone of PD„ matrices. They numerically show that this yields 
a practicable algorithm |12j . We here introduce two alternatives, updateHess 
and updateHessCorr, as outlined in Fig. [2] 



updateHess(/, x, B, e) 



Requires: Global variable T, 
initialized at first 
invocation to T = B 

Output : Hessian estimate 
B+ e PD„ 

1 u ~ S"-^ 

2 A, 



T 



3 if T 

4 I B+ 
else 

5 [_B+ ^ B 

6 return B 



/(x+eu)-2/(x)+/(x-eu) _ To 

- T -I- Au • uu^ e PD„ then 



+ 



updateHessCorr(/, x, B, e) 
Output; Hessian estimate 
B+ e PD„ 



1 U 

2 A„ 

3 if T 

4 I B+ 
else 

V -ir 
A^ 



/(x+eu)-2/(x)+/(x-tu) _ jjT^^ 

B + Au • uu^ e PD„ then 

^ T 

smallestEVec(r) 



/(x + ev)-2/(x) + /(x-6v) _ ^Trp^ 



S+ ^ (B + Av • vv^ 
8 return B^ 



■A, 



Figure 2: Two implementations of the VM update scheme (24 1. Left panel: The 
update ( 24 1 is applied to a temporary matrix T and the matrix B is only updated if 



T is positive definite. Right panel: The matrix B is updated in every step. Positive 
definiteness is established by an additional correction step (see main text for further 
information) . 
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In sub-routine updateHess the current matrix T is returned if it is positive 
definite. Otherwise, the last known positive definite matrix is used. As long as 
the iterates are positive definite, no additional computational effort is needed. 
The update can be implemented in Oin?) by using a rank-one update on the 
Cholesky decomposition of T. However, if T is not positive semidefinite this 
approach fails, and the condition on line 3 of the algorithm must be checked by 
computation of the smallest eigenvalue. 

In sub-routine updateHessCorr we ensure that the generated iterates are 
always positive definite. In case is not positive definite, we apply a second 
VM update step in direction v, where v is an eigenvector of corresponding 
to the smallest (hence negative) eigenvalue of B^. By standard matrix pertur- 
bation theory, as detailed in Lemma [15] below, the twice updated matrix will be 
positive semidefinite definite again (as H is). This scheme comes at the expense 
of two additional function evaluations at x± ev. This version of the VM update 
has already been successfully used in a recent numerical study |18j . 

Lemma 15. Let A e PD„, x G R" and zi G R" an eigenvector corresponding 
to the smallest eigenvalue of {A — xx^). Then 

B := A- XX^ + |Amin(A - xx^)| zizf G PD„ . 

Proof. The matrix (^4 — xx-^) is symmetric. Let [A — xx-^) = X^ILi ^i^i'^f 
denote its spectral decomposition with Ai < A2 < . . . A„ in increasing order. If 
Ai > 0, then there is nothing to show. Otherwise, we observe that by a variant 
of Weyl's theorem (cf. ^7", Theorem 4.3.4]), < X,{A) < Xi+i^A - xx^) = A,+i 
for i = 1, . . . , 71 — 1. Thus at most Ai can be negative. We conclude 



All zizf y > (Aizizf + |Ai| Zizf ) y > , 



for all y G R". □ 

Remark 16. The present VM update schemes can be used in several ways in 
combination with Random Pursuit: (i) One can use the VM scheme at the 
initial iterate xq G R" multiple times in order to get a good approximation B 
of the Hessian V^/(xo) and then employ S = B^^ in F-RP; (ii) one could 
continuously toggle between VM update steps and line searches (as shown in 
Fig. [7J right panel), yielding a fully adaptive scheme. The analysis of the latter 
combination is naturally more evolved. On quadratic functions, however, the 
VM update is independent of the current position x G R" and thus, V-RP is 
amenable to a theoretical analysis. Finally, note also that the update schemes 
are easily parallelizable due to the independence of the directions u and the 
calculation of uHu. This can be used on today's multi-core machines to speed- 
up the algorithm. 




5.2 Convergence of Variable Metric Random Pursuit 

We now show that the combination of RP with sequential VM updates indeed 
yields a convergent algorithm on convex quadratic functions. To simplify the 

notation, we write the VM update in terms of the error matrix E = (B — H). 
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Definition 17 (VM update scheme). Let Eq G R"^" be a symmetric matrix 
and {ufc}fe>o a sequence of independent vectors sampled uniformly from 5". 
Then the sequence {Ek}k>o is generated by the VM update if the iterates satisfy 



Ek+i = Ek — EkUk ■ UfeUj, 



(26) 



We observe that this definition matches with ( 24 ) for E/^ — Bj^. — H 



The following Lemma summarizes the most important properties of the VM 
update scheme. 



Lemma 18. Let {Ek}k>o be generated by the VM update (26) with Eq sym 



metric and let K =^ nln + 2) ln(a\/6) for parameters a > 1, 6 > 1. Then 
(i) \\Ek\\p<\\Ek-i\\p, fork>l, 



(it) E 



\E, 



(ni) \\Ek\ 



< 



< I- 



)k 2 
\\Eo\\f fork> 0, and 



l-^ol 



for all k > K , with probability at least 1 



Corollary 19. Let {-Bfc}fe>o with Bq symmetric a sequence of iterates generated 
either by the VM update as implemented in updateHessCorr or the sequence of 
internal matrices T in updateHess. Then the statement from Lemma 18 also 
holds for the sequence of error matrices {E'^.}k>o {Bk ~ H}^ 



tk>0- 



Proof. The internal matrices T in updateHess are exactly updated according 



to (26 1, hence nothing is to show. The iterates of updateHessCorr are almost 



generated according to ( 26 1 . The additional correction step (if necessary) can be 
viewed as one step of the VM update ( 24 ) in a special (not random) direction. 



However, by (i) of Lemma 18 the Frobenius norm of the error matrix will not 
increase by this step. □ 

of Lemma \18\ We note that (i) and (ii) were already proven by Leventhal and 
Lewis [12, Theorem 2.1]. It remains to show that (iii) follows from Markov's 
inequality. Since decreases monotonically, it suffices to prove (iii) for 

k — K. We have 



1 



n{n + 2) 



K 



< 



1 



{aVby 



hence, by (ii), 



n\\EK\\i] < 



1 WE, 



oil F 



By the Markov inequality, the probability that ||£';f ||p exceeds its expectation 
by more than a factor of b is at most 1 /b, and this yields the lemma. □ 

In order to show the effectiveness of the V-RP algorithm we have to argue 
that the convergence factor in Theorem [13] describing the single-step progress 
of F-RP converges to 1 — l/(n + 2). To provide some intuition on the con- 
vergence behavior, let us consider a quadratic function /(x) = ^x^i/x, with 
quadrati c up per and lower bound M — Li — H. Then the convergence factor in 
Theorem 13 depends only on the spectrum of HT,, with S = B^T^ = {H + Ek)~^. 
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C 



Lemma 20 (VM convergence factor). Let H € PD„, let {Ek}k>Q with Eq 
symmetric be a sequence of iterates generated by the VM update (26) on the 
function /(x) = ^x^iJx and let c < 1. If for k' > 0; 

\Ek'L < 

Il^-^ll2 

then the convergence factor from Theorem \l3\ can be upper bounded by 

_ A^in(gi]) ^ _ (1 - cy 

Tr[i/S] + 2A,„ax(i/S) - n + 2 ' ^ ^ 

for = {H + Ek')-\ 

Proof. To show the Lemma, we derive bounds on the smallest and largest eigen- 
values of the matrix iJE. First we observe that 

msi^{\X^ir,iEk'H-')\,\X^,^{Ek'H-')\} = \\Ek'H-% < WEk'^H'^ 

by the definition of the 2-norm and submultiplicativity. With the assumption 
on the product of the two norms, the whole expression can be upper bounded 
by c: 



\Ek'\\^\\H-'\\^<c<l. 



Therefore, the largest eigenvalue of HY, is well defined and finite: 

A,nax(i?S) = XnUiHY)-') ^ Xminiln + Ek' R-^) 

1 1 

< 



l + XminiEk'H-^) - 1 

With a similar argumentation we obtain a lower bound on the smallest eigen- 
value of 

A„,i„(ffS) = — ] > 1 - A,nax(^fe'ff~') > 1 - C, 

where the first inequality follows from >1 — a;fora::>— 1. These two 
bounds together with the trivial estimate TrfiJS] < nAmax(^^S) yield: 



XminiHY) ^ A,nin(gS]) ^ (1 - c) 



2 



Tr[HS] + 2An,ax(ifS) - (?^-f 2)A„,ax(i?S) - (n + 2) ' 
and the claim follows. □ 

Corollary 21. Let c < 1 and K = n{n+2)\n{\\H-^\\^\\Eo\\p Vb/c). Then ^ 
holds with probability at least 1 — ^■ 

Proof. We observe that ||i?fc||2 < ll-E-fclli?- Now the result follows from Lemma 



and Lemma 



18 



20 



(iii) with a = ||£;o|If/c. □ 



We can also interpret the statement of Corollary [21] as follows: By solving 
the equation for c and plugging the result into (27 1, we obtain an upper bound 
on the convergence factor depending only on b, the success parameter, and K, 
the number of iterations. With probability at least 1 — | it holds: 

c<\\H-^\\^\\Eo\\pVbe-^^ . (28) 

In summary, these results guarantee the near-optimal linear convergence of 
V-RP after an initial learning phase of at most K steps with high probability. 
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5.3 Illustrative numerical examples 



We now illustrate the typical convergence behavior of Variable Metric Random 
Pursuit on three instances of convex quadratic functions. The first instance is 
Nesterov's worst case function /ncs: 



/nos(x) 



n-1 



Xi 



m 2 

y 11-11 



We here use this strongly convex function with parameter m = 1 and I = 10^. 
Further information about this function as well as extensive numerical results 
of the convergence of standard RP on this function can be found in [T5] . 



^ 10^° 
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Figure 3: Convergence of V-RP with updateHessCorr on /nos- Upper panel: Mean 
(blue) and max/min (grey) function values vs. # ITS over 111 runs. Lower Panel: 
Mean (blue) and max/min (grey) convergence factor (see Thm. 131 vs. # ITS over 
111 runs. Theoretical upper bound (red) on the convergence factor. See main text for 
further information. 




5 10 15 20 



Figure 4: Convergence of V-RP with updateHess on /ncs- Upper panel: Mean (blue) 
and max/min (grey) function values vs. # ITS over 111 runs. Lower Panel: Mean 
(blue) and max/min (grey) convergence factor vs. # ITS over 111 runs. See main 
text for further information. 
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#ITS / 

Figure 5: Convergence of V-RP with updateHessCorr on /piat- Upper panel: Mean 
(blue) and max/min (grey) function values vs. # ITS over 111 runs. Lower Panel: 
Mean (blue) and max/min (grey) convergence factor (see Thm. |13[ ) vs. # ITS over 
111 runs. See main text for further information. 
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Figure 6: Convergence of V-RP with updateHess on /piat. Upper panel: Mean (blue) 
and max/min (grey) function values vs. # ITS over 111 runs. Lower Panel: Mean 



(blue) and max/min (grey) convergence factor (see Thm. 131 vs. # ITS over 111 runs. 
See main text for further information. 



In addition, we test V-RP on two quadratic functions with sigmoidal and 
inverse sigmoidal (almost flat) distribution of eigenvalues, referred to as /sigm 
and /piat, respectively. These functions have been introduced in and are 
defined as 

/sig,n(x) = X^mn, ((l + e^^-^)"' + {x, if . 

and 

/Flat(x) = E^nx, (^-log (^(^10-« + 

with the normalizing function nmi{f{t)) = ^^Tjjfu + with £ = lO"^ . 



- 1 
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Figure 7: Convergence of V-RP with updateHessCorr on /sigm- Upper panel: Mean 
(blue) and max/min (grey) function values vs. # ITS over 111 runs are reported. 
Lower Panel: Mean (blue) and max/min (grey) convergence factor (see Thm. |13[ ) vs. 
# ITS over 111 runs are reported. Theoretical upper bound (red) on the convergence 
factor. See main text for further information. 




Figure 8: Convergence of V-RP with updateHess on /sigm- Upper panel: Mean (blue) 
and max/min (grey) function values vs. # ITS over 111 runs are reported. Lower 
Panel: Mean (blue) and max/min (grey) convergence factor (see Thm. 13 1 vs. # ITS 
over 111 runs are reported. See main text for further information. 



For all considered quadratic functions the ratio of largest to smallest eigen- 
value of the Hessians (i.e. the condition number) is 10^, and their global min- 
imum at X* = 1„ (where 1„ is the all-ones vector) with /(x*) = 0. On all 
functions we conduct 111 runs of V-RP in n = 50 dimeirsions. The initial con- 
ditions are Xq — 0, Bq — 0.5 • 10^ • /„. For comparison both VM update schemes 
(see Fig. [2]) are tested. We here report the evolution of the mean, maximum, and 
minimum function value vs. number of iterations {=ff ITS). We also calculate 
and report the derived convergence factor from Thm. |13[ We observe a number 
of interesting features of the numerical optimization trajectories. Independent 
of the concrete function instance we see that V-RP shows three distinct phases 
of convergence in function values: (i) a first short phase of rapid improvement, 
(ii) a metric learning phase with only marginal progress in function decrease. 
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and (iii) a final rapid decrease in function value. For the considered functions 
this last phase appears after 10-15 iterations. We also observe that, on aver- 
age, the theoretically derived convergence factor indeed captures the on-set of 
the fast convergence of V-RP. Both VM update schemes have a similar effect 
on convergence, with updateHess producing a more discontinuous trajectory 
of function values. This is expected because in V-RP with updateHess the 
employed metric stays fixed over longer periods of iterations (until the Hessian 
estimate becomes positive definite again). On /ncs the numerical data suggest 
that updateHess is preferred over updateHessCorr because the former strategy 
consistently needs less iterations (both on average and on the tails) to reach 
the global minimum (see Figs, [s] and |4j respectively). This can also be ob- 
served on /piat (see Figs. |5]and|6|. On /sigm, however, the V-RP scheme 
with updateHessCorr is superior (see Figs, [t] and [s] respectively). Overall, the 
numerical results show the clear advantage of learning a variable metric. Non- 
adaptive schemes such as standard RP or Nesterov's randomized gradient-free 
schemes would not even come close to the achieved solution accuracy within the 
considered function evaluation budget [121 . 

In Figs. 3]and[7]we show the upper bound on the convergence factor obtained 
in Lem. [20 and Cor. [2l] with parameter 6 = 2 (see also the discussion at the 
end of Sec. |5.2[ ). In general, we expect the bound to be loose because the 
derivation in both Lem. [20] and Cor. [21] are based on worst-case assumptions 
that do not reflect the average situation. We indeed observe that the bound is 
very conservative on Nesterov's function and does over-estimate the observed 
convergence factors. On function /sigm, however, the bound is much closer to the 
observed convergence factors. This is in full agreement with previous numerical 
studies [18] that identified /sigm as a challenging function for adaptive schemes. 
The presented experiments suggest that the convergence factor, derived in Thm. 
|13[ is an interesting quantity that deserves further theoretical investigation. 



6 Discussion and Conclusion 

In this contribution we have analyzed Random Pursuit algorithms that employ 
(i) a fixed but arbitrary metric (Fixed Metric Random Pursuit) and (ii) a vari- 
able metric learning procedure (Variable Metric Random Pursuit). We have 
detailed convergence proofs and convergence rates for these Random Pursuit 
algorithms on convex functions. We have used an improved (matrix) quadratic 
upper bound technique to show expected single-step progress and global conver- 
gence of Fixed Metric Random Pursuit on (strictly) convex functions. We have 
also shown that Variable Metric Random Pursuit can achieve a near-optimal 
convergence rate on convex quadratic functions that, after a finite learning phase 
of length at most O(n^), does not depend on the underlying properties of the un- 
known Hessian of the function. Compared to standard Random Pursuit [19) we 
have thus removed one of our previously identified challenges toward the design 
of competitive gradient-free optimization methods that are easy to implement, 
possess theoretical convergence guarantees, and are useful in practice 

Nonetheless, a number of theoretical challenges remain. Firstly, it would 
be very interesting to give tighter upper and lower bounds on the expected 
convergence factor for Variable Metric Random Pursuit. This is subject to 
further research, although some first results have already been obtained in [50] . 
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Secondly, it is still an open question how to analyze Random Pursuit schemes 
for constrained optimization problems of the form 

min/(x) subject to x £ JC , (29) 

where /C C M" is a convex set. Furthermore, it is open how to derive conver- 
gence guarantees for Random Pursuit schemes on the class of globally convex (or 
(5-convex) functions [8 , or on noisy functions with certain bounds on the vari- 
ance of the noise. Finally, convergence on the important class of non-smooth 
convex functions is another fundamental challenge for gradient-free optimiza- 
tion that, most likely, needs novel tools and techniques to be developed by the 
mathematical programming community. 
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